bcdata = read_csv("bcdata_Assignment1.csv") |> janitor::clean_names()
summary = skimr::skim(bcdata) |>
select(skim_variable, numeric.mean, numeric.p50, numeric.p0, numeric.p100) |> knitr::kable(digits = 2) |> print()
##
##
## |skim_variable | numeric.mean| numeric.p50| numeric.p0| numeric.p100|
## |:--------------|------------:|-----------:|----------:|------------:|
## |age | 57.30| 56.00| 24.00| 89.00|
## |bmi | 27.58| 27.66| 18.37| 38.58|
## |glucose | 97.79| 92.00| 60.00| 201.00|
## |insulin | 10.01| 5.92| 2.43| 58.46|
## |homa | 2.69| 1.38| 0.47| 25.05|
## |leptin | 26.62| 20.27| 4.31| 90.28|
## |adiponectin | 10.18| 8.35| 1.66| 38.04|
## |resistin | 14.73| 10.83| 3.21| 82.10|
## |mcp_1 | 534.65| 471.32| 45.84| 1698.44|
## |classification | 1.55| 2.00| 1.00| 2.00|
# There's no missing in this dataset.
q2 = bcdata |> mutate(who_bmi =
ifelse(bmi < 16.5, "Severely underweight",
ifelse(16.5 <= bmi & bmi < 18.5, "Underweight",
ifelse(18.5 <= bmi & bmi < 25, "Normal weight",
ifelse(25 <= bmi & bmi < 30, "Overweight",
ifelse(30 <= bmi & bmi < 35, "Obesity class I",
ifelse(35 <= bmi & bmi < 40, "Obesity class II","Obesity class III"))))))) |>
mutate(classification = recode(classification, "1" = "Healthy Controls", "2" = "Breast Cancer Patients")) |> arrange(bmi)
#check if I have recoded bmi correctly
table(q2$bmi, q2$who_bmi)
#Since there's no healthy controls in underweight category, I added a category "underweight healthhy controls" with 0 count to make sure every column has the same width.
freq_table = q2 |> group_by(who_bmi, classification) |>
summarise(n = n()) |> mutate(proportion = n / sum(n) * 100)
supp = data.frame(who_bmi = "Underweight",
classification = "Healthy Controls",
n = 0, proportion = 0)
final_freq = bind_rows(supp, freq_table) |> mutate(who_bmi = as.factor(who_bmi))
level = c("Severely underweight", "Underweight", "Normal weight", "Overweight", "Obesity class I", "Obesity class II", "Obesity class III")
final_freq |>
mutate(who_bmi = forcats::fct_relevel(who_bmi, level),
text_label = str_c(proportion, "%")) |>
plot_ly(x = ~who_bmi, y = ~proportion, color = ~classification, type = "bar", colors = "viridis", text = ~text_label) |>
layout(title = 'Barplot showing the proportion of breast cancer cases and controls by WHO BMI category', xaxis = list(title = 'WHO BMI Category'), yaxis = list(title = 'Proportion(%)'), barmode = "stack")
The proportion of each BMI category are cases and the proportion of each BMI category are controls are clearly showed on the stacked proportional bar chart,if you put the mouse on the bar.
But I think a barchart showing porportion of the whole sample within each category.(each category doesn’t accounted for 1) is more informative.
q4 = bcdata |>
mutate(classification = recode(classification, "1" = 0, "2" = 1))
# recode 1-healthy controls as "0", 2-breast cancer patients as "1"
table(q4$classification, bcdata$classification)
##
## 1 2
## 0 52 0
## 1 0 64
# recoded correctly
mylogit = glm(classification ~ glucose + homa + leptin + bmi + age, data = q4, family = "binomial")
mylogit |> broom::tidy() |> knitr::kable(digits = 2)
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -3.63 | 2.36 | -1.54 | 0.12 |
| glucose | 0.08 | 0.02 | 3.47 | 0.00 |
| homa | 0.27 | 0.17 | 1.59 | 0.11 |
| leptin | -0.01 | 0.02 | -0.54 | 0.59 |
| bmi | -0.10 | 0.06 | -1.84 | 0.07 |
| age | -0.02 | 0.01 | -1.59 | 0.11 |
cbind(estimate = coef(mylogit), confint(mylogit)) |> knitr::kable(digits = 2)
| estimate | 2.5 % | 97.5 % | |
|---|---|---|---|
| (Intercept) | -3.63 | -8.54 | 0.75 |
| glucose | 0.08 | 0.04 | 0.13 |
| homa | 0.27 | -0.03 | 0.65 |
| leptin | -0.01 | -0.04 | 0.02 |
| bmi | -0.10 | -0.22 | 0.00 |
| age | -0.02 | -0.05 | 0.00 |
exp(cbind(OR = coef(mylogit), confint(mylogit))) |> knitr::kable(digits = 2)
| OR | 2.5 % | 97.5 % | |
|---|---|---|---|
| (Intercept) | 0.03 | 0.00 | 2.13 |
| glucose | 1.09 | 1.04 | 1.14 |
| homa | 1.32 | 0.97 | 1.92 |
| leptin | 0.99 | 0.96 | 1.02 |
| bmi | 0.90 | 0.80 | 1.00 |
| age | 0.98 | 0.95 | 1.00 |
Estimate: for HOMA-IR, the beta estimate is 0.27,and 95% confidence interval is (-0.03, 0.65).
OR: for HOMA-IR, the OR is 1.32,and 95% confidence interval is (0.97, 1.92).
Interpretation: The odds of developing breast cancer increase by 1.32 unit with 1-unit increase in HOMA-IR. I am 95% confident that the true OR lies between 0.97 and 1.92. p-value = 0.11 > 0.05.We cannot reject the null hypothesis that the HOMA-IR level has no association with breast cancer at a 5% level of significance. We have insufficient evidence to support that HOMA-IR is significantly associated with breast cancer, adjusting for glucose, leptin, bmi and age.
q5 = bcdata
fit = lm(insulin ~ bmi + age + glucose, data = q5)
fit |> broom::tidy() |> knitr::kable(digits = 2)
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -13.50 | 5.86 | -2.30 | 0.02 |
| bmi | 0.15 | 0.16 | 0.91 | 0.36 |
| age | -0.05 | 0.05 | -1.04 | 0.30 |
| glucose | 0.23 | 0.04 | 6.13 | 0.00 |
cbind(estimate = coef(fit), confint(fit)) |> knitr::kable(digits = 2)
| estimate | 2.5 % | 97.5 % | |
|---|---|---|---|
| (Intercept) | -13.50 | -25.11 | -1.89 |
| bmi | 0.15 | -0.17 | 0.47 |
| age | -0.05 | -0.16 | 0.05 |
| glucose | 0.23 | 0.16 | 0.30 |
Estimate: for age, the beta estimate is -0.05,and 95% confidence interval is (-0.16, 0.05).
Interpretation: The level of insulin decreases by 0.05 μU/mL with one-year increase in age, adjusting for bmi and glucose.I am 95% confident that the true change lies between -0.16 and 0.05 μU/mL. p-value = 0.30 > 0.05.We cannot reject the null hypothesis that age has no association with insulin level at a 5% level of significance. We have insufficient evidence to support that age is significantly associated with insulin level, adjusting for bmi and glucose.